Submit a Zip file with your R Markdown file, the HTML output, and any supplementary files (e.g. data, figures, etc.).
Breast cancer is one of the most dreaded and deadly cancer diagnosis that woman can receive. For women in the U.S., breast cancer death rates are higher than those for any other cancer, besides lung cancer. Many institutions have dedicated years of research into improving the survival chances of breast cancer patients and there has been a measure of improvement in the new incidence rates since 2000. Treatment advances, earlier detection through screening, and increased awareness are all key factors in surviving breast cancer and the emergence of machine learning in medical research is an important step in detecting and predicting malignant tumors.
Each year it is estimated that over 252,710 US women will be diagnosed with breast cancer. About 1 in 8 US women will develop invasive breast cancer over the course of her lifetime. Invasive cancer, or Stage-4 breast cancer, is also called metastatic breast cancer. Metastasis happens when cancer cells migrate from the breast elsewhere in the body, triggering cancerous growth and is terminal meaning there is no cure. More than 40,000 US women a year die from metastatic breast cancer and that number has not changed since 1970. Research enabling earlier detection of malignancy is imperative to the survival of women diagnosed with breast cancer.
Tests such as MRI, mammogram, ultrasound and biopsy are commonly used to diagnose breast cancer. Dr. William H. Wolberg, a physician at the University Of Wisconsin Hospital at Madison, created a dataset using Fine Needle Aspiration biopsies to collect samples from patients with solid breast masses and a computer vision approach known as “snakes” to compute values for each of ten characteristics of each nuclei, measuring size, shape and texture. The mean, standard error and extreme values of these features are computed, resulting in a total of 30 nuclear features for each sample.
Using this dataset we can examine the observations of the biospies and investigate whether there are any variables or any combination of the variables which are predictors for a malignant or benign diagnosis.
Describe how the data were collected.
“Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.”
The features in these datasets characterise cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses.
Each case represents an individual sample or observation of tissue taken from a biopsy of a breast mass. There 569 observations in the given data set.
What are the two variables you will be studying? State the type of each variable.
The response variable is the diagnosis which is a qualitative binary categorical variable of either benign or malignant.
There are 30 independent variables which are quantitative and an additional variable conputed from radius_mean called radius_mean_size is a qualitative variable that describes the size of the mass as being in the bottom half or top half of the size range. The variables are all aspects of the tissue samples and include the mean, standard error and worst case for each variable.
Ten real-valued features are computed for each cell nucleus:
All feature values are recoded with four significant digits.
Missing attribute values: none
Class distribution: 357 benign, 212 malignant
What is the type of study, observational or an experiment? Explain how you’ve arrived at your conclusion using information on the sampling and/or experimental design.
This study is an observational study of the biopsied breast tissue mass.
Identify the population of interest, and whether the findings from this analysis can be generalized to that population, or, if not, a subsection of that population. Explain why or why not. Also discuss any potential sources of bias that might prevent generalizability.
Can these data be used to establish causal links between the variables of interest? Explain why or why not.
Perform relevant descriptive statistics, including summary statistics and visualization of the data. Also address what the exploratory data analysis suggests about your research question.
There are no missing values.
Boxplots of the 10 mean variables vs. diagnosis:
Logistic Regression is used for modeling when there is a categorical response variable with two levels. Logistic regression is a Generalized Linear Model
Conditions for Logistic Regression:
H0 = All variables (individually or in combination) are good predictors for benign or malignant diagnosis. H1 = There is a specific variable or combination of variables which are good predictors for benign or malignant diagnosis.
Generalized linear models (GLMs) are an extension of linear models to model non-normal response variables. Logistic regression is for binary response variables, where there are two possible outcomes.
##
## Call:
## glm(formula = diagnosis ~ radius_mean + texture_mean + area_mean +
## smoothness_mean + compactness_mean + concavity_mean + concave_points_mean +
## symmetry_mean + radius_se + perimeter_se + area_se + smoothness_se +
## compactness_se + concavity_se + concave_points_se + symmetry_se +
## fractal_dimension_se + texture_worst + perimeter_worst +
## smoothness_worst + compactness_worst + concavity_worst +
## concave_points_worst + symmetry_worst + fractal_dimension_worst,
## family = binomial, data = bc_data_no_id, control = list(maxit = 100))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.286e-05 -2.100e-08 -2.100e-08 2.100e-08 3.609e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.186e+03 4.933e+06 0.002 0.999
## radius_mean -6.878e+03 1.356e+06 -0.005 0.996
## texture_mean 1.564e+02 4.047e+04 0.004 0.997
## area_mean 6.126e+01 1.215e+04 0.005 0.996
## smoothness_mean 7.278e+04 4.545e+07 0.002 0.999
## compactness_mean -6.855e+04 1.520e+07 -0.005 0.996
## concavity_mean 4.981e+04 1.442e+07 0.003 0.997
## concave_points_mean 3.723e+04 4.929e+07 0.001 0.999
## symmetry_mean -3.498e+04 8.236e+06 -0.004 0.997
## radius_se 1.335e+04 1.225e+07 0.001 0.999
## perimeter_se -2.412e+03 7.015e+05 -0.003 0.997
## area_se 1.158e+02 9.156e+04 0.001 0.999
## smoothness_se -9.419e+04 9.515e+07 -0.001 0.999
## compactness_se 1.140e+05 3.630e+07 0.003 0.997
## concavity_se -1.182e+05 2.703e+07 -0.004 0.997
## concave_points_se 4.666e+05 1.142e+08 0.004 0.997
## symmetry_se -1.452e+05 6.319e+07 -0.002 0.998
## fractal_dimension_se -1.027e+06 4.784e+08 -0.002 0.998
## texture_worst 1.156e+02 2.989e+04 0.004 0.997
## perimeter_worst 2.446e+02 5.477e+04 0.004 0.996
## smoothness_worst -1.681e+04 2.914e+07 -0.001 1.000
## compactness_worst -1.295e+04 6.999e+06 -0.002 0.999
## concavity_worst 8.347e+03 3.462e+06 0.002 0.998
## concave_points_worst 1.663e+04 1.390e+07 0.001 0.999
## symmetry_worst 3.567e+04 1.096e+07 0.003 0.997
## fractal_dimension_worst 1.153e+05 6.143e+07 0.002 0.999
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7.5144e+02 on 568 degrees of freedom
## Residual deviance: 1.0168e-08 on 543 degrees of freedom
## AIC: 52
##
## Number of Fisher Scoring iterations: 45
take genetic model and look at each variable selected for high sd and high p-value - run several times selecting out the best variables from the resulting models, select out those variables and then rerun in regular or best glm (hopefully a certain set will tend to show everytime you run the model)
2^30 ways to do different possible model combinations, computationally infeasible, must do model selection taking into account the high correlation between variables, which means there are lots of combinations that would make good models
The deviance residual is useful for determining if individual points are not well fit by the model. The deviance residual for the ith observation is the signed square root of the contribution of the ith case to the sum for the model deviance, DEV .
In standard linear models, we estimate the parameters by minimizing the sum of the squared residuals. Equivalent to finding parameters that maximize the likelihood. In a GLM we also fit parameters by maximizing the likelihood. Estimation is equivalent to finding parameter values that minimize the deviance.
Often we have variables that are highly correlated and therefore redundant. By eliminating highly correlated features we can avoid a predictive bias for the information contained in these features.
Correlations between all features are calculated and visualised with the corrplot package. ##I will consider removing all features with a correlation higher than 0.7, keeping the feature with the lower mean.##
Akaike information criterion (AIC) (Akaike, 1974) is a fined technique based on in-sample fit to estimate the likelihood of a model to predict/estimate the future values.
A good model is the one that has minimum AIC among all the other models. The AIC can be used to select between the additive and multiplicative Holt-Winters models.
Bayesian information criterion (BIC) (Stone, 1979) is another criteria for model selection that measures the trade-off between model fit and complexity of the model. A lower AIC or BIC value indicates a better fit.
The Akaike information criterion (AIC) is an estimator of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.
AIC is an estimate of a constant plus the relative distance between the unknown true likelihood function of the data and the fitted likelihood function of the model, so that a lower AIC means a model is considered to be closer to the truth.
AIC basic principles:
Lower indicates a more parsimonious model, relative to a model fit with a higher AIC.
It is a relative measure of model parsimony, so it only has meaning if we compare the AIC for alternate hypotheses (= different models of the data).
The comparisons are only valid for models that are fit to the same response data (ie values of y).
You shouldn’t compare too many models with the AIC. You will run into the same problems with multiple model comparison as you would with p-values, in that you might by chance find a model with the lowest AIC, that isn’t truly the most appropriate model.
When using the AIC you might end up with multiple models that perform similarly to each other. So you have similar evidence weights for different alternate hypotheses.
Variable | AIC
------------------------ | ----------------------
radius_mean | 334.010844
texture_mean | 650.5191272
perimeter_mean | 308.4843935
area_mean | 329.6565111
smoothness_mean | 677.9484566
compactness_mean | 512.791919
concavity_mean | 387.2272159
concave_points_mean | 262.9234074
symmetry_mean | 690.7961695
fractal_dimension_mean | 755.3459406
radius_se | 484.6467303
texture_se | 755.4006776
perimeter_se | 476.8298438
area_se | 363.5025918
smoothness_se | 752.7882659
compactness_se | 705.7747342
concavity_se | 711.0863343
concave_points_se | 650.0070385
symmetry_se | 755.4157412
fractal_dimension_se | 752.060941
radius_worst | 233.108517
texture_worst | 626.0682259
perimeter_worst | 213.4799408
area_worst | 234.6393233
smoothness_worst | 645.4249535
compactness_worst | 509.5529504
concavity_worst | 441.6976201
concave_points_worst | 254.4507681
symmetry_worst | 645.4167557
fractal_dimension_worst | 693.389445
A genetic algorithm is a search heuristic that is inspired by Charles Darwin’s theory of natural evolution. The genetic algorithm is a method for solving both constrained and unconstrained optimization problems that is based on natural selection, the process that drives biological evolution. The genetic algorithm repeatedly modifies a population of individual solutions.
Using a genetic algorith to select the variables which best predict a benign or malignant outcome:
##
## Call:
## fitfunc(formula = as.formula(x), family = ..1, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.95257 -0.00179 -0.00006 0.00000 2.54664
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.599e+01 1.228e+01 -2.929 0.003397 **
## radius_mean -4.032e+00 1.410e+00 -2.859 0.004245 **
## compactness_mean -1.348e+02 5.485e+01 -2.457 0.014004 *
## concavity_mean 9.616e+01 4.563e+01 2.108 0.035072 *
## symmetry_mean -5.504e+01 3.645e+01 -1.510 0.131095
## concave_points_mean 1.502e+02 9.259e+01 1.622 0.104811
## radius_se 1.559e+01 4.911e+00 3.175 0.001496 **
## concave_points_se 9.878e+02 3.202e+02 3.085 0.002033 **
## concavity_se -1.311e+02 5.620e+01 -2.333 0.019666 *
## fractal_dimension_se -1.733e+03 6.309e+02 -2.747 0.006012 **
## texture_worst 5.327e-01 1.404e-01 3.793 0.000149 ***
## area_worst 5.370e-02 1.569e-02 3.423 0.000620 ***
## symmetry_worst 4.354e+01 1.949e+01 2.234 0.025453 *
## fractal_dimension_worst 2.988e+02 1.094e+02 2.732 0.006295 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 751.440 on 568 degrees of freedom
## Residual deviance: 39.971 on 555 degrees of freedom
## AIC: 67.971
##
## Number of Fisher Scoring iterations: 12
## glmulti.analysis
## Method: g / Fitting: glm / IC used: aic
## Level: 1 / Marginality: FALSE
## From 20 models:
## Best IC: 67.9708389310763
## Best model:
## [1] "diagnosis_b ~ 1 + radius_mean + compactness_mean + concavity_mean + "
## [2] " symmetry_mean + concave_points_mean + radius_se + concave_points_se + "
## [3] " concavity_se + fractal_dimension_se + texture_worst + area_worst + "
## [4] " symmetry_worst + fractal_dimension_worst"
## Evidence weight: 0.174035220873474
## Worst IC: 78.2587663336834
## 7 models within 2 IC units.
## 10 models to reach 95% of evidence weight.
## Convergence after 340 generations.
## Time elapsed: 1.03255878289541 minutes.
Write a brief summary of your findings without repeating your statements from earlier. Also include a discussion of what you have learned about your research question and the data you collected. You may also want to include ideas for possible future research.
This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/
Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Creators:
Dr. William H. Wolberg, General Surgery Dept. University of Wisconsin, Clinical Sciences Center Madison, WI 53792 wolberg ‘@’ eagle.surgery.wisc.edu
W. Nick Street, Computer Sciences Dept. University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 street ‘@’ cs.wisc.edu 608-262-6619
Olvi L. Mangasarian, Computer Sciences Dept. University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 olvi ‘@’ cs.wisc.edu
W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.
O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995.
W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 163-171.
W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17 No. 2, pages 77-87, April 1995.
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized breast cancer diagnosis and prognosis from fine needle aspirates. Archives of Surgery 1995;130:511-516.
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived nuclear features distinguish malignant from benign breast cytology. Human Pathology, 26:792–796, 1995.